Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

نویسندگان

  • Ian Osband
  • Benjamin Van Roy
چکیده

Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such as UCRL2. We provide insight into the extent of this performance boost and the phenomenon that drives it. We leverage this insight to establish an ̃ O(H p SAT ) Bayesian regret bound for PSRL in finite-horizon episodic Markov decision processes. This improves upon the best previous Bayesian regret bound of ̃ O(HS p AT ) for any reinforcement learning algorithm. Our theoretical results are supported by extensive empirical evaluation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

(More) Efficient Reinforcement Learning via Posterior Sampling

Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Mark...

متن کامل

The End of Optimism

Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications. Current approaches focus on generalising existing techniques for finite-armed bandits, notably the optimism principle and Thompson sampling. Prior analysis has mostly focussed on the worst-case setting. We analyse the asymptotic regret and show matching upper and lower...

متن کامل

An Optimistic Posterior Sampling Strategy for Bayesian Reinforcement Learning

We consider the problem of decision making in the context of unknown Markov decision processes with finite state and action spaces. In a Bayesian reinforcement learning framework, we propose an optimistic posterior sampling strategy based on the maximization of state-action value functions of MDPs sampled from the posterior. First experiments are promising. Introduction. The design of algorithm...

متن کامل

Thompson Sampling for Linear-Quadratic Control Problems

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (TS) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the...

متن کامل

Near-optimal Reinforcement Learning in Factored MDPs

Any learning algorithm over Markov decision processes (MDPs) will have worst-case regret Ω( √ SAT ) where T is the elapsed time and S and A are the cardinalities of the state and action spaces. In many settings of interest S and A may be so huge that it is impossible to guarantee good performance for an arbitrary MDP on any practical timeframe T . We show that, if we know the true system can be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017